19 research outputs found

    Direct Estimation of Differences in Causal Graphs

    Full text link
    We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models with a shared topological order given i.i.d. samples from each model. This is of interest for example in genomics, where changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our two-step algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during T-cell activation

    Causal Network Models of SARS-CoV-2 Expression and Aging to Identify Candidates for Drug Repurposing

    Full text link
    Given the severity of the SARS-CoV-2 pandemic, a major challenge is to rapidly repurpose existing approved drugs for clinical interventions. While a number of data-driven and experimental approaches have been suggested in the context of drug repurposing, a platform that systematically integrates available transcriptomic, proteomic and structural data is missing. More importantly, given that SARS-CoV-2 pathogenicity is highly age-dependent, it is critical to integrate aging signatures into drug discovery platforms. We here take advantage of large-scale transcriptional drug screens combined with RNA-seq data of the lung epithelium with SARS-CoV-2 infection as well as the aging lung. To identify robust druggable protein targets, we propose a principled causal framework that makes use of multiple data modalities. Our analysis highlights the importance of serine/threonine and tyrosine kinases as potential targets that intersect the SARS-CoV-2 and aging pathways. By integrating transcriptomic, proteomic and structural data that is available for many diseases, our drug discovery platform is broadly applicable. Rigorous in vitro experiments as well as clinical trials are needed to validate the identified candidate drugs

    Multimodal LLMs for health grounded in individual-specific data

    Full text link
    Foundation large language models (LLMs) have shown an impressive ability to solve tasks across a wide range of fields including health. To effectively solve personalized health tasks, LLMs need the ability to ingest a diversity of data modalities that are relevant to an individual's health status. In this paper, we take a step towards creating multimodal LLMs for health that are grounded in individual-specific data by developing a framework (HeLM: Health Large Language Model for Multimodal Understanding) that enables LLMs to use high-dimensional clinical modalities to estimate underlying disease risk. HeLM encodes complex data modalities by learning an encoder that maps them into the LLM's token embedding space and for simple modalities like tabular data by serializing the data into text. Using data from the UK Biobank, we show that HeLM can effectively use demographic and clinical features in addition to high-dimensional time-series data to estimate disease risk. For example, HeLM achieves an AUROC of 0.75 for asthma prediction when combining tabular and spirogram data modalities compared with 0.49 when only using tabular data. Overall, we find that HeLM outperforms or performs at parity with classical machine learning approaches across a selection of eight binary traits. Furthermore, we investigate the downstream uses of this model such as its generalizability to out-of-distribution traits and its ability to power conversations around individual health and wellness

    Computational methods for analyzing and modeling gene regulation and 3D genome organization

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Computational and Systems Biology Program, February, 2021Cataloged from the official PDF of thesis.Includes bibliographical references (pages 261-281).Biological processes from differentiation to disease progression are governed by gene regulatory mechanisms. Currently large-scale omics and imaging data sets are being collected to characterize gene regulation at every level. Such data sets present new opportunities and challenges for extracting biological insights and elucidating the gene regulatory logic of cells. In this thesis, I present computational methods for the analysis and integration of various data types used for cell profiling. Specifically, I focus on analyzing and linking gene expression with the 3D organization of the genome. First, I describe methodologies for elucidating gene regulatory mechanisms by considering multiple data modalities. I design a computational framework for identifying colocalized and coregulated chromosome regions by integrating gene expression and epigenetic marks with 3D interactions using network analysis.Then, I provide a general framework for data integration using autoencoders and apply it for the integration and translation between gene expression and chromatin images of naive T-cells. Second, I describe methods for analyzing single modalities such as contact frequency data, which measures the spatial organization of the genome, and gene expression data. Given the important role of the 3D genome organization in gene regulation, I present a methodology for reconstructing the 3D diploid conformation of the genome from contact frequency data. Given the ubiquity of gene expression data and the recent advances in single-cell RNA-sequencing technologies as well as the need for causal modeling of gene regulatory mechanisms, I then describe an algorithm as well as a software tool, difference causal inference (DCI), for learning causal gene regulatory networks from gene expression data.DCI addresses the problem of directly learning differences between causal gene regulatory networks given gene expression data from two related conditions. Finally, I shift my focus from basic biology to drug discovery. Given the current COVID19 pandemic, I present a computational drug repurposing platform that enables the identification of FDA approved compounds for drug repurposing and investigation of potential causal drug mechanisms. This framework relies on identifying drugs that reverse the signature of the infection in the space learned by an autoencoder and then uses causal inference to identify putative drug mechanisms.by Anastasiya Belyaeva.Ph. D.Ph.D. Massachusetts Institute of Technology, Computational and Systems Biology Progra

    DCI: learning causal differences between gene regulatory networks

    No full text
    Abstract Summary Designing interventions to control gene regulation necessitates modeling a gene regulatory network by a causal graph. Currently, large-scale gene expression datasets from different conditions, cell types, disease states, and developmental time points are being collected. However, application of classical causal inference algorithms to infer gene regulatory networks based on such data is still challenging, requiring high sample sizes and computational resources. Here, we describe an algorithm that efficiently learns the differences in gene regulatory mechanisms between different conditions. Our difference causal inference (DCI) algorithm infers changes (i.e. edges that appeared, disappeared, or changed weight) between two causal graphs given gene expression data from the two conditions. This algorithm is efficient in its use of samples and computation since it infers the differences between causal graphs directly without estimating each possibly large causal graph separately. We provide a user-friendly Python implementation of DCI and also enable the user to learn the most robust difference causal graph across different tuning parameters via stability selection. Finally, we show how to apply DCI to single-cell RNA-seq data from different conditions and cell states, and we also validate our algorithm by predicting the effects of interventions. Availability and implementation Python package freely available at http://uhlerlab.github.io/causaldag/dci. Supplementary information Supplementary data are available at Bioinformatics online. </jats:sec

    Pedagogical Value of Folklore Prosaic Genres in Intellectual and Moral Education of Future Teachers at the Foreign Language Lessons

    No full text
    The article is devoted to the issue of pedagogical value of folklore in intellectual and moral education of future teachers at the foreign language lessons. The authors of the article analyze why is folklore a valuable source in training teachers. In the article it was researched folklore genres which can be successfully used with educational purpose. It was also researched the works of well – known foreign and Russian scientists who made a big contribution to the theoretical and practical research of folklore. The authors of the article researched the works of pedagogues who used Russian folklore genres in kindergartens, elementary and secondary schools. The aim of the study is to demonstrate the pedagogical value of folklore prosaic genres in intellectual and moral education of future teachers at the foreign language lessons. Before analyzing the pedagogical value of folklore genres in intellectual and moral education of future teachers the authors held an opinion poll at the university. It was surveyed 200 students of the first and the second courses at the institute of psychology and education. There were some questions about folklore of the native language: if they knew some cradle songs or if they remembered tongue twisters which their parents had used to improve their pronunciation skills or if they remembered some fairy tales from their childhood, if they remembered counting out rhymes. 47 % of the future primary teachers couldn’t remember cradle songs. 33 % students responded that they knew some cradle songs but they couldn’t sing the whole songs. 20% of students didn’t hear cradle songs from their parents. Students who took part in questionnaire and in polling showed better results in retelling fairy tales of the native languages (Russian and Tatar). 88% of future infant school teachers could remember the content of fairy tales. They told that they heard them from parents and read them at primary schools. Just 5 % students could remember local legends and fables. 60 students (2 groups) of primary school teachers and English language teachers were taken to the experimental groups where the authors of the article evaluated the results of the experiment using folklore genres of the native and second languages in teaching foreign language. At the end of the experimental year students replied to some questionnaires. 100 % of students liked to learn foreign language by means of folklore as native so foreign one. They said that they would use them in their future work. 99% students underlined educational value of folklore in bringing up children, in intellectual and moral education of younger generation. 100 % students who learned foreign language by means of folklore genres highlighted that their acquired knowledge, competence, and skills would be very useful in their future work with children at primary schools

    Identifying 3D Genome Organization in Diploid Organisms via Euclidean Distance Geometry

    No full text
    | openaire: EC/H2020/748354/EU//NonnegativeRankThe spatial organization of the genome in the cell nucleus plays an important role for gene regulation, replication of the deoxyribonucleic acid (DNA), and genomic integrity. Through the development of chromosome conformation capture experiments (such as 3C, 4C, and Hi-C) it is now possible to obtain the contact frequencies of the DNA at the whole-genome level. In this paper, we study the problem of reconstructing the three-dimensional (3D) organization of the genome from such whole-genome contact frequencies. A standard approach is to transform the contact frequencies into noisy distance measurements and then apply semidefinite programming formulations to obtain the 3D configuration. However, neglected in such reconstructions is the fact that most eukaryotes including humans are diploid and therefore contain two copies of each genomic locus. We prove that the 3D organization of the DNA is not identifiable from the distance measurements derived from contact frequencies in diploid organisms. In fact, there areinfinitely many solutions even in the noise-free setting. We then discuss various additional biologically relevant and experimentally measurable constraints (including distances between neighboring genomic loci and higher-order interactions) and prove identifiability under these conditions. Furthermore, we provide semidefinite programming formulations for computing the 3D embedding of the DNA with these additional constraints and show that we can recover the true 3D embedding with high accuracy from both noiseless and noisy measurements. Finally, we apply our algorithm to real pairwise and higher-order contact frequency data and show that we can recover known genome organization patterns.Peer reviewe

    Network analysis identifies chromosome intermingling regions as regulatory hotspots for transcription

    No full text
    The 3D structure of the genome plays a key role in regulatory control of the cell. Experimental methods such as high-throughput chromosome conformation capture (Hi-C) have been developed to probe the 3D structure of the genome. However, it remains a challenge to deduce from these data chromosome regions that are colocalized and coregulated. Here, we present an integrative approach that leverages 1D functional genomic features (e.g., epigenetic marks) with 3D interactions from Hi-C data to identify functional interchromosomal interactions. We construct a weighted network with 250-kb genomic regions as nodes and Hi-C interactions as edges, where the edge weights are given by the correlation between 1D genomic features. Individual interacting clusters are determined using weighted correlation clustering on the network. We show that intermingling regions generally fall into either active or inactive clusters based on the enrichment for RNA polymerase II (RNAPII) and H3K9me3, respectively. We show that active clusters are hotspots for transcription factor binding sites. We also validate our predictions experimentally by 3D fluorescence in situ hybridization (FISH) experiments and show that active RNAPII is enriched in predicted active clusters. Our method provides a general quantitative framework that couples 1D genomic features with 3D interactions from Hi-C to probe the guiding principles that link the spatial organization of the genome with regulatory control. Keywords: chromosome intermingling; Hi-C; network and clustering analysis; epigenetics; 3D FISHNational Institutes of Health (U.S.) (Grant T32GM87232)National Science Foundation (U.S.) (Grant 1122374)National Science Foundation (U.S.) (Grant 1651995)United States. Defense Advanced Research Projects Agency (Grant W911NF-16-1-0551)United States. Office of Naval Research (Grant N00014-17-1-2147
    corecore